{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Word vectors from SEC filings using gensim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section, we will learn word and phrase vectors from annual SEC filings using gensim to illustrate the potential value of word embeddings for algorithmic trading. In the following sections, we will combine these vectors as features with price returns to train neural networks to predict equity prices from the content of security filings.\n",
    "\n",
    "In particular, we use a dataset containing over 22,000 10-K annual reports from the period 2013-2016 that are filed by listed companies and contain both financial information and management commentary (see chapter 3 on Alternative Data). For about half of 11K filings for companies that we have stock prices to label the data for predictive modeling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports & Settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-12-08T22:35:17.862176Z",
     "start_time": "2018-12-08T22:35:17.757049Z"
    }
   },
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from time import time\n",
    "from collections import Counter\n",
    "import logging\n",
    "from gensim.models import Word2Vec\n",
    "from gensim.models.word2vec import LineSentence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-12-08T22:26:08.716608Z",
     "start_time": "2018-12-08T22:26:08.713845Z"
    }
   },
   "outputs": [],
   "source": [
    "pd.set_option('display.expand_frame_repr', False)\n",
    "np.random.seed(42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def format_time(t):\n",
    "    m, s = divmod(t, 60)\n",
    "    h, m = divmod(m, 60)\n",
    "    return '{:02.0f}:{:02.0f}:{:02.0f}'.format(h, m, s)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Logging Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-12-08T22:26:09.622852Z",
     "start_time": "2018-12-08T22:26:09.618313Z"
    }
   },
   "outputs": [],
   "source": [
    "logging.basicConfig(\n",
    "        filename='preprocessing.log',\n",
    "        level=logging.DEBUG,\n",
    "        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',\n",
    "        datefmt='%H:%M:%S')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Paths"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each filing is a separate text file and a master index contains filing metadata. We extract the most informative sections, namely\n",
    "- Item 1 and 1A: Business and Risk Factors\n",
    "- Item 7 and 7A: Management's Discussion and Disclosures about Market Risks\n",
    "\n",
    "The notebook preprocessing shows how to parse and tokenize the text using spaCy, similar to the approach in chapter 14. We do not lemmatize the tokens to preserve nuances of word usage.\n",
    "\n",
    "We use gensim to detect phrases. The Phrases module scores the tokens and the Phraser class transforms the text data accordingly. The notebook shows how to repeat the process to create longer phrases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-12-08T22:05:44.659946Z",
     "start_time": "2018-12-08T22:05:44.650955Z"
    }
   },
   "outputs": [],
   "source": [
    "filing_path = Path('data/filings')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sections_path = Path('data/sections')\n",
    "if not sections_path.exists():\n",
    "    sections_path.mkdir(exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Identify Sections"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i, filing in enumerate(filing_path.glob('*.txt')):\n",
    "    if i % 500 == 0:\n",
    "        print(i, end=' ', flush=True)\n",
    "    filing_id = int(filing.stem)\n",
    "    items = {}\n",
    "    for section in filing.read_text().lower().split('°'):\n",
    "        if section.startswith('item '):\n",
    "            if len(section.split()) > 1:\n",
    "                item = section.split()[1].replace('.', '').replace(':', '').replace(',', '')\n",
    "                text = ' '.join([t for t in section.split()[2:]])\n",
    "                    if items.get(item) is None or len(items.get(item)) < len(text):\n",
    "                        items[item] = text\n",
    "\n",
    "    txt = pd.Series(items).reset_index()\n",
    "    txt.columns = ['item', 'text']\n",
    "    txt.to_csv(sections_path / (filing.stem + '.csv'), index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parse Sections"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Select the following sections:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-12-08T22:15:15.102683Z",
     "start_time": "2018-12-08T22:15:15.100109Z"
    }
   },
   "outputs": [],
   "source": [
    "sections = ['1', '1a', '7', '7a']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_path = Path('data/selected_sections')\n",
    "if not clean_path.exists():\n",
    "    clean_path.mkdir(exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nlp = spacy.load('en', disable=['ner'])\n",
    "nlp.max_length = 6000000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab = Counter()\n",
    "t = total_tokens = 0\n",
    "stats = []\n",
    "\n",
    "start = time()\n",
    "done = 1\n",
    "for text_file in sections_path.glob('*.csv'):\n",
    "    file_id = int(text_file.stem)\n",
    "    clean_file = clean_path / f'{file_id}.csv'\n",
    "    if clean_file.exists():\n",
    "        continue\n",
    "    items = pd.read_csv(text_file).dropna()\n",
    "    items.item = items.item.astype(str)\n",
    "    items = items[items.item.isin(sections)]\n",
    "    if done % 100 == 0:\n",
    "        duration = time() - start\n",
    "        to_go = (to_do - done) * duration / done\n",
    "        print(f'{done:>5}\\t{format_time(duration)}\\t{total_tokens / duration:,.0f}\\t{format_time(to_go)}')\n",
    "    \n",
    "    clean_doc = []\n",
    "    for _, (item, text) in items.iterrows():\n",
    "        doc = nlp(text)\n",
    "        for s, sentence in enumerate(doc.sents):\n",
    "            clean_sentence = []\n",
    "            if sentence is not None:\n",
    "                for t, token in enumerate(sentence, 1):\n",
    "                    if not any([token.is_stop,\n",
    "                                token.is_digit,\n",
    "                                not token.is_alpha,\n",
    "                                token.is_punct,\n",
    "                                token.is_space,\n",
    "                                token.lemma_ == '-PRON-',\n",
    "                                token.pos_ in ['PUNCT', 'SYM', 'X']]):\n",
    "                        clean_sentence.append(token.text.lower())\n",
    "                total_tokens += t\n",
    "                if len(clean_sentence) > 0:\n",
    "                    clean_doc.append([item, s, ' '.join(clean_sentence)])\n",
    "    (pd.DataFrame(clean_doc,\n",
    "                  columns=['item', 'sentence', 'text'])\n",
    "     .dropna()\n",
    "     .to_csv(clean_file, index=False))\n",
    "    done += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create ngrams"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-12-08T22:36:42.347622Z",
     "start_time": "2018-12-08T22:36:42.343529Z"
    }
   },
   "outputs": [],
   "source": [
    "ngram_path = Path('data', 'ngrams')\n",
    "stats_path = Path('corpus_stats')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-12-08T22:36:57.526969Z",
     "start_time": "2018-12-08T22:36:57.522768Z"
    }
   },
   "outputs": [],
   "source": [
    "def create_unigrams(min_length=3):\n",
    "    texts = []\n",
    "    sentence_counter = Counter()\n",
    "    unigrams = ngram_path / 'ngrams_1.txt'\n",
    "    vocab = Counter()\n",
    "    for f in path.glob('*.csv'):\n",
    "        df = pd.read_csv(f)\n",
    "        df.item = df.item.astype(str)\n",
    "        df = df[df.item.isin(items)]\n",
    "        sentence_counter.update(df.groupby('item').size().to_dict())\n",
    "        for sentence in df.text.str.split().tolist():\n",
    "            if len(sentence) >= min_length:\n",
    "                vocab.update(sentence)\n",
    "                texts.append(' '.join(sentence))\n",
    "    (pd.DataFrame(sentence_counter.most_common(), \n",
    "                  columns=['item', 'sentences'])\n",
    "     .to_csv(stats_path / 'selected_sentences.csv', index=False))\n",
    "    (pd.DataFrame(vocab.most_common(), columns=['token', 'n'])\n",
    "     .to_csv(stats_path / 'sections_vocab.csv', index=False))\n",
    "    unigrams.write_text('\\n'.join(texts))\n",
    "    return [l.split() for l in texts]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start = time()\n",
    "if not unigrams.exists():\n",
    "    texts = create_unigrams()\n",
    "else:\n",
    "    texts = [l.split() for l in unigrams.open()]\n",
    "print('Reading: ', format_time(time() - start))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_ngrams(max_length=3):\n",
    "    \"\"\"Using gensim to create ngrams\"\"\"\n",
    "\n",
    "    n_grams = pd.DataFrame()\n",
    "    start = time()\n",
    "    for n in range(2, max_length + 1):\n",
    "        print(n, end=' ', flush=True)\n",
    "\n",
    "        sentences = LineSentence(f'ngrams_{n - 1}.txt')\n",
    "        phrases = Phrases(sentences=sentences,\n",
    "                          min_count=25,  # ignore terms with a lower count\n",
    "                          threshold=0.5,  # accept phrases with higher score\n",
    "                          max_vocab_size=40000000,  # prune of less common words to limit memory use\n",
    "                          delimiter=b'_',  # how to join ngram tokens\n",
    "                          progress_per=50000,  # log progress every\n",
    "                          scoring='npmi')\n",
    "\n",
    "        s = pd.DataFrame([[k.decode('utf-8'), v]\n",
    "                          for k, v in phrases.export_phrases(sentences)]\n",
    "                         , columns=['phrase', 'score']).assign(length=n)\n",
    "\n",
    "        n_grams = pd.concat([n_grams, s])\n",
    "        grams = Phraser(phrases)\n",
    "        sentences = grams[sentences]\n",
    "        Path(f'ngrams_{n}.txt').write_text('\\n'.join([' '.join(s) for s in sentences]))\n",
    "\n",
    "    n_grams = n_grams.sort_values('score', ascending=False)\n",
    "    n_grams.phrase = n_grams.phrase.str.replace('_', ' ')\n",
    "    n_grams['ngram'] = n_grams.phrase.str.replace(' ', '_')\n",
    "\n",
    "    with pd.HDFStore('vocab.h5') as store:\n",
    "        store.put('ngrams', n_grams)\n",
    "\n",
    "    print('\\n\\tDuration: ', format_time(time() - start))\n",
    "    print('\\tngrams: {:,d}\\n'.format(len(n_grams)))\n",
    "    print(n_grams.groupby('length').size())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "create_ngrams()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inspect Corpus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-12-08T23:46:12.167011Z",
     "start_time": "2018-12-08T23:46:12.054686Z"
    }
   },
   "outputs": [],
   "source": [
    "ngrams = pd.read_parquet('corpus_stats/ngrams.parquet')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-12-08T23:46:12.428814Z",
     "start_time": "2018-12-08T23:46:12.358566Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 721562 entries, 10742145 to 4887103\n",
      "Data columns (total 3 columns):\n",
      "phrase    721562 non-null object\n",
      "score     721562 non-null float64\n",
      "length    721562 non-null int64\n",
      "dtypes: float64(1), int64(1), object(1)\n",
      "memory usage: 22.0+ MB\n"
     ]
    }
   ],
   "source": [
    "ngrams.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-12-08T23:47:20.650064Z",
     "start_time": "2018-12-08T23:47:20.551220Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    721562.000000\n",
       "mean          0.631225\n",
       "std           0.125067\n",
       "min           0.500000\n",
       "10%           0.512507\n",
       "20%           0.526746\n",
       "30%           0.543690\n",
       "40%           0.564299\n",
       "50%           0.589516\n",
       "60%           0.621228\n",
       "70%           0.663055\n",
       "80%           0.722132\n",
       "90%           0.824150\n",
       "max           1.000000\n",
       "Name: score, dtype: float64"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "percentiles=np.arange(.1, 1, .1).round(2)\n",
    "ngrams.score.describe(percentiles=percentiles)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-12-10T07:56:42.135744Z",
     "start_time": "2018-12-10T07:56:42.086001Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>phrase</th>\n",
       "      <th>score</th>\n",
       "      <th>length</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>13138522</th>\n",
       "      <td>topsoe uop</td>\n",
       "      <td>0.700002</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22155584</th>\n",
       "      <td>aastra prairiefyre</td>\n",
       "      <td>0.700009</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21581977</th>\n",
       "      <td>sre tre</td>\n",
       "      <td>0.700009</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9717859</th>\n",
       "      <td>twp nng</td>\n",
       "      <td>0.700017</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1507180</th>\n",
       "      <td>ecomobile telkonet</td>\n",
       "      <td>0.700017</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26474295</th>\n",
       "      <td>knsd kxas</td>\n",
       "      <td>0.700017</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17960106</th>\n",
       "      <td>oxalate ssri</td>\n",
       "      <td>0.700017</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6936430</th>\n",
       "      <td>swirl estimote</td>\n",
       "      <td>0.700017</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25398447</th>\n",
       "      <td>gdtna gdte</td>\n",
       "      <td>0.700024</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14638108</th>\n",
       "      <td>chun guang</td>\n",
       "      <td>0.700024</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                      phrase     score  length\n",
       "13138522          topsoe uop  0.700002       2\n",
       "22155584  aastra prairiefyre  0.700009       2\n",
       "21581977             sre tre  0.700009       2\n",
       "9717859              twp nng  0.700017       2\n",
       "1507180   ecomobile telkonet  0.700017       2\n",
       "26474295           knsd kxas  0.700017       2\n",
       "17960106        oxalate ssri  0.700017       2\n",
       "6936430       swirl estimote  0.700017       2\n",
       "25398447          gdtna gdte  0.700024       2\n",
       "14638108          chun guang  0.700024       2"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ngrams[ngrams.score>.7].sort_values(['length', 'score']).head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-12-08T23:49:20.481793Z",
     "start_time": "2018-12-08T23:49:20.399896Z"
    }
   },
   "outputs": [],
   "source": [
    "vocab = pd.read_csv('corpus_stats/sections_vocab.csv').dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-12-08T23:49:21.447127Z",
     "start_time": "2018-12-08T23:49:21.429999Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 201443 entries, 0 to 201444\n",
      "Data columns (total 2 columns):\n",
      "token    201443 non-null object\n",
      "n        201443 non-null int64\n",
      "dtypes: int64(1), object(1)\n",
      "memory usage: 4.6+ MB\n"
     ]
    }
   ],
   "source": [
    "vocab.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-12-08T23:49:26.121094Z",
     "start_time": "2018-12-08T23:49:26.087771Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count     201443\n",
       "mean        1440\n",
       "std        22366\n",
       "min            1\n",
       "10%            1\n",
       "20%            2\n",
       "30%            3\n",
       "40%            4\n",
       "50%            7\n",
       "60%           12\n",
       "70%           24\n",
       "80%           61\n",
       "90%          260\n",
       "max      2576751\n",
       "Name: n, dtype: int64"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vocab.n.describe(percentiles).astype(int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-12-08T23:52:32.605872Z",
     "start_time": "2018-12-08T23:51:25.921419Z"
    }
   },
   "outputs": [],
   "source": [
    "tokens = Counter()\n",
    "for l in Path('data', 'ngrams', 'ngrams_3.txt').open():\n",
    "    tokens.update(l.split())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-12-08T23:52:33.446549Z",
     "start_time": "2018-12-08T23:52:33.151560Z"
    }
   },
   "outputs": [],
   "source": [
    "tokens = pd.DataFrame(tokens.most_common(),\n",
    "                     columns=['token', 'count'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-12-08T23:52:33.550537Z",
     "start_time": "2018-12-08T23:52:33.489729Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 664963 entries, 0 to 664962\n",
      "Data columns (total 2 columns):\n",
      "token    664963 non-null object\n",
      "count    664963 non-null int64\n",
      "dtypes: int64(1), object(1)\n",
      "memory usage: 10.1+ MB\n"
     ]
    }
   ],
   "source": [
    "tokens.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-12-08T23:52:41.859378Z",
     "start_time": "2018-12-08T23:52:41.542641Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    546779\n",
       "mean         56\n",
       "std        1947\n",
       "min           1\n",
       "10%           1\n",
       "20%           1\n",
       "30%           2\n",
       "40%           2\n",
       "50%           3\n",
       "60%           3\n",
       "70%           4\n",
       "80%           6\n",
       "90%          13\n",
       "max      513694\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokens.loc[tokens.token.str.contains('_'), 'count'].describe(percentiles).astype(int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-12-10T07:57:44.279871Z",
     "start_time": "2018-12-10T07:57:43.976999Z"
    }
   },
   "outputs": [],
   "source": [
    "tokens[tokens.token.str.contains('_')].head(20).to_csv('ngram_examples.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get returns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with pd.HDFStore('../data/assets.h5') as store:\n",
    "    stocks = store['quandl/wiki/stocks']\n",
    "    prices = store['quandl/wiki/prices'].adj_close"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sec = pd.read_csv('data/report_index.csv').rename(columns=str.lower)\n",
    "sec.date_filed = pd.to_datetime(sec.date_filed)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "idx = pd.IndexSlice"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "first = sec.date_filed.min() + relativedelta(months=-1)\n",
    "last = sec.date_filed.max() + relativedelta(months=1)\n",
    "prices = (prices\n",
    "          .loc[idx[first:last, :]]\n",
    "          .unstack().resample('D')\n",
    "          .ffill()\n",
    "          .dropna(how='all', axis=1)\n",
    "          .filter(sec.ticker.unique()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sec = sec.loc[sec.ticker.isin(prices.columns), ['ticker', 'date_filed']]\n",
    "\n",
    "price_data = []\n",
    "for ticker, date in sec.values.tolist():\n",
    "    target = date + relativedelta(months=1)\n",
    "    s = prices.loc[date: target, ticker]\n",
    "    price_data.append(s.iloc[-1] / s.iloc[0] - 1)\n",
    "\n",
    "df = pd.DataFrame(price_data,\n",
    "                  columns=['returns'],\n",
    "                  index=sec.index)\n",
    "\n",
    "print(df.returns.describe())\n",
    "sec['returns'] = price_data\n",
    "print(sec.info())\n",
    "sec.dropna().to_csv('data/sec_returns.csv', index=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
